Work Description
I am working with a large volume of data from water temperature loggers. Each logger records a time temperature observation every quarter-hour, and is saved as a csv file.
Each logger time series needs to be inspected and have parts of the time series removed (e.g., if the logger is exposed to air instead of water). I would like to do this task in an R script rather than manually altering the csv files.
I would like to have an R script that can perform the following tasks:
Use a function to import multiple csv files from the same directory, and collate them all into a single dataframe.
Use a separate data table with defined time periods of “flagged data” to identify flagged data in each time series
For example, the “flagged data” table would identify the time periods
from “2022-08-01 05:15:00” to “2022-08-05 05:30:00”
from “2022-09-02 05:30:00” to “2022-09-05 06:45:15”
The script would mutate a new “flagged_data” column to identify flagged data from these time periods
Multiple time periods would be identified as flagged data, from multiple loggers
I have placed three example csv files in the “input” folder of this repository. I have also placed an example “temp_logger_flagged_data.csv” file in the same folder.
If possible, it is my preference that the script is based in tidyverse functions, as my colleagues are more familiar with this language. If it is not possible to do this, that is OK.
It is my preference that we accomplish this task through GitHub. If hired, please make a “pull request” to the example repository.
Thank you!
Basic Example Time series (for one logger)
# load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor)
## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(hms)
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# read in csv
temp_dat <- read_csv("input/csv/21235341.csv", col_types = cols("i", "c", "d", "?", "_", "_", "_"), skip = 1) %>%
select(starts_with(c("Date","Temp")))
colnames(temp_dat) <- c("date_time","temp_C")
temp_dat %<>%
transform(date_time = mdy_hms(date_time))
# plot
temp_dat %>%
ggplot(aes(date_time,temp_C)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
From the above plot, we can see that the logger is exposed to air rather than water; we would like to flag the data between approximately 7/4/2022 12:00:00 to 7/6/2022 12:00:00.
Additional Example
# read in, combine, and prepare multiple csv files
# specify directory of cvs files
dir <- "input/csv/"
# read in and combine all csv files
temp_dat <- list.files(path = dir,
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(., skip = 1)) %>%
remove_empty() %>%
select(starts_with(c("Date","Temp"))) %>%
pivot_longer(cols = starts_with("Temp"),
names_to = "logger_id",
values_to = "temp_C") %>%
filter(!is.na(temp_C)) %>%
mutate(logger_id = str_sub(logger_id,-9,-2)) %>%
rename(date_time = `Date Time, GMT-08:00`) %>%
# convert date time to compatible class
transform(date_time = mdy_hms(date_time))
## value for "which" not specified, defaulting to c("rows", "cols")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 9606 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 10816960), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 10816960, SEN S/N: 10816960)
## lgl (1): Coupler Attached (LGR S/N: 10816960)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 43434 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 20012591), End Of ...
## dbl (2): #, Temp, °C (LGR S/N: 20012591, SEN S/N: 20012591)
## lgl (2): Coupler Attached (LGR S/N: 20012591), Host Connected (LGR S/N: 2001...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 6906 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 20625008), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 20625008, SEN S/N: 20625008)
## lgl (1): Coupler Attached (LGR S/N: 20625008)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 6983 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21235340), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21235340, SEN S/N: 21235340)
## lgl (1): Coupler Attached (LGR S/N: 21235340)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 6990 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21235341), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21235341, SEN S/N: 21235341)
## lgl (1): Coupler Attached (LGR S/N: 21235341)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 8825 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21235343), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21235343, SEN S/N: 21235343)
## lgl (1): Coupler Attached (LGR S/N: 21235343)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 10362 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444843), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444843, SEN S/N: 21444843)
## lgl (1): Coupler Attached (LGR S/N: 21444843)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 7009 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444844), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444844, SEN S/N: 21444844)
## lgl (1): Coupler Attached (LGR S/N: 21444844)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 8821 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444869), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444869, SEN S/N: 21444869)
## lgl (1): Coupler Attached (LGR S/N: 21444869)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 10357 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444870), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444870, SEN S/N: 21444870)
## lgl (1): Coupler Attached (LGR S/N: 21444870)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 6986 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444872), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444872, SEN S/N: 21444872)
## lgl (1): Coupler Attached (LGR S/N: 21444872)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 9691 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444873), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444873, SEN S/N: 21444873)
## lgl (1): Coupler Attached (LGR S/N: 21444873)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 10361 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21444874), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21444874, SEN S/N: 21444874)
## lgl (1): Coupler Attached (LGR S/N: 21444874)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 5078 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date Time, GMT-08:00, Coupler Detached (LGR S/N: 21488145), Host Co...
## dbl (2): #, Temp, °C (LGR S/N: 21488145, SEN S/N: 21488145)
## lgl (1): Coupler Attached (LGR S/N: 21488145)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The above chunk appears to be functional in reading in and combining multiple csvs. However I, would appreciate any advice on how to make it more efficient, and double-checking to see that it is not making some kind of systematic error.
Next, we will visualize logger time series one at a time, and record the parts of each time series that we would like to remove in the file located at “input/temp_logger_flagged_data.csv”. An example logger time series is shown below.
# plot
logger <- "21444843"
ggplotly(
p <- temp_dat %>%
filter(logger_id == logger) %>%
ggplot(aes(date_time,temp_C)) +
geom_point()
)
Next, we will read in a table of time periods for each logger that we would like flag for removal. We will mutate a a column titled “useData”, with “1” indicating OK data, and “0” indicating bad data.
# read in file of visually identified flagged data
flagged_data <- read.csv("input/temp_logger_flagged_data.csv", sep = ",") %>%
transform(date_time_start = mdy_hm(date_time_start),
date_time_stop = mdy_hm(date_time_stop)) %>%
transform(logger_id = as.character(logger_id)) %>%
select(-notes,-X) %>%
filter(!is.na(logger_id))
## Warning: 1 failed to parse.
# from the dataframe of all water temp data, add a new column of flagged data; with "useData = 0" for flagged time periods
flagged_data_df <- temp_dat %>%
mutate(useData = case_when(
date_time >= flagged_data$date_time_start &
date_time <= flagged_data$date_time_stop &
logger_id == flagged_data$logger_id ~ 0
))
## Warning in `>=.default`(date_time, flagged_data$date_time_start): longitud de
## objeto mayor no es múltiplo de la longitud de uno menor
## Warning in `<=.default`(date_time, flagged_data$date_time_stop): longitud de
## objeto mayor no es múltiplo de la longitud de uno menor
## Warning in logger_id == flagged_data$logger_id: longitud de objeto mayor no es
## múltiplo de la longitud de uno menor
Here, we encounter an error, and code does not behave as expected. The above chunk will run, but the following error is generated:
Warning: There were 3 warnings in mutate(). The first
warning was: ℹ In argument: useData = case_when(...).
Caused by warning in >=.default: ! longer object length
is not a multiple of shorter object length ℹ Run
dplyr::last_dplyr_warnings() to see the 2 remaining warnings.
The resulting dataframe, “flagged_data_df” does not flag all data appropriately within the specified time windows, only partially. Reasons foor this error are unclear.